Note: This week we are working with the Week 5 data again and new data for Week 6. For both of these data sets, you have the option of working with a “small” file if it takes a lot of time for code to run on your computer. The file names are wk5_small.csv and wk6_small.csv.
In this week’s notebook, you will learn how to flip the coordinates of a graph and animate a set of graphs to show change over time. We will use these tools to make population pyramids. In your lab, you will use population pyramids to visualize the effects of changing immigration policies across the twentieth century.
To see how coordinate flipping works, we are going to start by going back to last week’s data on women’s occupations. Load the tidyverse package, read in the data for Week 5, and remove records for Alaska and Hawaii before 1960.
library(tidyverse)
ipums <- read_csv("wk5.csv")
ipums<-ipums %>% filter(!STATEFIP %in% c(2,15) & YEAR < 1961)
The following code makes the basic column graph of women’s occupations over time that we began with last week, including only women with an occupation listed.
women <- ipums %>% mutate(OCCUP = OCC1950 %/% 100 + 1) %>%
mutate(OCCUP = ifelse(OCCUP == 9, 2,
ifelse(OCC1950 %in% 910:979, 9, OCCUP))) %>%
filter(OCCUP < 10) %>% mutate(OCCUP = factor(OCCUP, labels = c("Prof/Tech", "Farming",
"Managers", "Clerical", "Sales", "Crafts", "Operatives", "Service", "Laborers"))) %>%
filter(SEX == 2 & AGE >= 15 & AGE <= 64) %>%
group_by(YEAR, OCCUP) %>% count(wt = PERWT)
occplot <- ggplot(women, aes(YEAR, n, fill = OCCUP)) + geom_col() +
scale_x_continuous(breaks = c(1900, 1930, 1960, 1990)) +
scale_y_continuous(labels = scales::comma) +
scale_fill_brewer(palette = "Set3") +
theme_minimal(base_size = 20) +
labs(x = "Census Year", y = "Number of Women", fill = "Occupational Category",
title = "Occupational Change for Women Aged 15-64 with Occupation Listed")
occplot
One problem with this kind of plot is that the size of any given category affects the position of the other categories, so it can be hard to compare how a given category has changed over time. For example, did the professional and technical sector grow between 1900 and 1910, or does it just look that way because the bar is higher in the 1910 column? One way to enhance comparability within columns is to facet on occupational categories, either instead of or in addition to using color to denote occupational categories. Do this by adding a facet_wrap() layer to your plot.
occplot + facet_wrap(vars(OCCUP)) + guides(fill = "none")
This helps us compare change within each category over time, but it is now much harder to compare the size of each category at a given point in time. What if we facet by year and put occupational categories on the x-axis? Save this new plot as occplot2 and print it.
occplot <- ggplot(women, aes(YEAR, n, fill = OCCUP)) + geom_col() +
scale_x_continuous(breaks = c(1900, 1930, 1960, 1990)) +
scale_y_continuous(labels = scales::comma) +
scale_fill_brewer(palette = "Set3") +
theme_minimal(base_size = 20) +
labs(x = "Census Year", y = "Number of Women", fill = "Occupational Category",
title = "Occupational Change for Women Aged 15-64 with Occupation Listed")
occplot2 <- occplot <- ggplot(women, aes(OCCUP, n, fill = OCCUP)) + geom_col() +
scale_y_continuous(labels = scales::comma) +
scale_fill_brewer(palette = "Set3") +
theme_minimal(base_size = 20) +
labs(x = "Occupational Category", y = "Number of Women", fill = "Occupational Category",
title = "Occupational Change for Women Aged 15-64 with Occupation Listed") + facet_wrap(vars(YEAR)) + guides(fill = "none")
occplot2
Now we can see the size of different occupational categories in any given year, but it is much harder to see changes in those categories over time. We also can’t see what the categories are. We can fix both of these things by simply turning the plot on its side, which we do by adding the coord_flip() layer.
occplot2 + coord_flip()
It is still hard to read the labels across the bottom, but we can change the y-axis labels to count by millions of women.
occplot2 + coord_flip() +
scale_y_continuous(breaks = seq(0, 20000000, 5000000), labels = seq(0, 20, 5)) +
labs(y = "Millions of Women")
Scale for 'y' is already present. Adding another scale for 'y', which
will replace the existing scale.
Here we can see the enormous growth of female labor force participation over the twentieth century. It is harder to see changes in the distribution of working women across occupational categories. Instead of plotting the number of women in each category, we can plot the percent of working women in each category. To do this, we need to add a column to our women data frame that includes the total number of working women in each year.
women
women <- women %>% group_by(YEAR) %>% mutate (TOTAL = sum(n))
women
Copy and paste your code for occplot2 here, and adjust it to graph percents rather than total numbers. You will need to change the value plotted on the y-axis, the scale on the y-axis, and the y-axis label. Don’t forget to add the coord_flip() layer.
ggplot(women, aes(OCCUP, n/TOTAL, fill = OCCUP)) + geom_col() +
scale_y_continuous(labels = scales::percent) +
scale_fill_brewer(palette = "Set3") +
theme_minimal(base_size = 20) +
labs(x = "Occupational Category", y = "Number of Women", fill = "Occupational Category",
title = "Occupational Change for Women Aged 15-64 with Occupation Listed") + facet_wrap(vars(YEAR)) + guides(fill = "none") + coord_flip()
occplot2
Now that we know how to flip our coordinates, we can make population pyramids like this one, which was widely regarded as the best data visualization of 2014. A population pyramid is a visualization demographers use to quickly see the age-sex structure of a population. Here we will use it to see the effects on the U.S. population of changing immigration laws across the twentieth century. Read in the data for Week 6. We won’t be using the Week 5 data again, so you can save the Week 6 data frame as ipums.
ipums <- read_csv("wk6.csv")
Check to see which samples and variables we are working with
for (year in unique(ipums$YEAR)) {
print(year)
print(summary(filter(ipums, YEAR == year)$PERWT))
}
[1] 1900
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 100.0 100.0 100.1 101.0 101.0
[1] 1910
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 95.83 100.29 99.96 104.40 196.02
[1] 1920
Min. 1st Qu. Median Mean 3rd Qu. Max.
65.76 97.64 100.60 100.91 103.89 177.50
[1] 1930
Min. 1st Qu. Median Mean 3rd Qu. Max.
65.40 99.65 100.75 100.90 101.95 278.85
[1] 1940
Min. 1st Qu. Median Mean 3rd Qu. Max.
19.00 100.00 100.00 96.43 100.00 100.00
[1] 1950
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 48.00 66.00 79.19 83.00 330.00
[1] 1960
Min. 1st Qu. Median Mean 3rd Qu. Max.
99.00 99.00 100.00 99.61 100.00 100.00
[1] 1970
Min. 1st Qu. Median Mean 3rd Qu. Max.
100 100 100 100 100 100
[1] 1980
Min. 1st Qu. Median Mean 3rd Qu. Max.
100 100 100 100 100 100
[1] 1990
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 69.00 96.00 99.25 126.00 1728.00
names(ipums)
[1] "YEAR" "STATEFIP" "PERWT" "SEX" "AGE" "BPL"
[7] "BPLD"
ipums<-ipums %>% filter(YEAR > 1950 | !STATEFIP %in% c(2,15))
ipums %>% filter(YEAR < 1960 & STATEFIP %in% c(2,15))
For now, we are going to use our population pyramids to compare the native-born and foreign-born segments of the U.S. population. In your lab, you will further subdivide the foreign-born segment. Take a look at the documentation for BPL and BPLD and create a new variable called NATIVITY that indicates whether or not a person was born in the United States. We will be faceting on this variable, so make it a factor variable with meaningful labels. While you are at it, also factor SEX, creating a new variable SEXF. Test that you have done this correctly.
ipums <- ipums %>% mutate(NATIVITY = factor(BPL >= 100,labels = c("Born in the U.S.", "Not Born in the U.S.")), SEXF=factor(SEX,labels=c("Male","Female")))
table(ipums$BPL,ipums$NATIVITY)
Born in the U.S. Not Born in the U.S.
1 362718 0
2 9829 0
4 64398 0
5 234079 0
6 610467 0
8 113877 0
9 161466 0
10 31205 0
11 55157 0
12 206434 0
13 426131 0
15 30591 0
16 49905 0
17 799692 0
18 402868 0
19 307767 0
20 209584 0
21 372717 0
22 295829 0
23 100239 0
24 211827 0
25 410258 0
26 522788 0
27 297540 0
28 279438 0
29 435361 0
30 50368 0
31 151892 0
32 15306 0
33 53500 0
34 368026 0
35 66316 0
36 1215471 0
37 442390 0
38 71598 0
39 746351 0
40 216686 0
41 94684 0
42 1056258 0
44 65878 0
45 260553 0
46 71256 0
47 372093 0
48 753891 0
49 81446 0
50 46225 0
51 360287 0
53 149856 0
54 208494 0
55 353311 0
56 21856 0
90 54 0
99 146502 0
100 0 369
105 0 1302
110 0 38627
115 0 1345
120 0 774
150 0 110756
155 0 3
160 0 1176
199 0 25
200 0 105540
210 0 17329
250 0 19991
260 0 23746
300 0 21484
400 0 11783
401 0 8388
402 0 380
403 0 4
404 0 22842
405 0 36830
410 0 66511
411 0 23038
412 0 4358
413 0 1642
414 0 73693
419 0 4
420 0 4776
421 0 13114
422 0 28
423 0 513
424 0 10
425 0 11579
426 0 8575
429 0 16
430 0 636
431 0 16
432 0 34
433 0 14960
434 0 123706
435 0 278
436 0 11356
437 0 19
438 0 5641
440 0 1
450 0 40986
451 0 810
452 0 22784
453 0 150452
454 0 24935
455 0 69365
456 0 8103
457 0 12927
458 0 102
459 0 8
460 0 698
461 0 2320
462 0 9853
463 0 7
465 0 83221
499 0 807
500 0 20024
501 0 13569
502 0 9961
509 0 3
510 0 4
511 0 1372
512 0 815
513 0 2113
514 0 401
515 0 19504
516 0 236
517 0 1800
518 0 7928
519 0 19
520 0 273
521 0 9499
522 0 3469
523 0 1
524 0 34
530 0 28
531 0 215
532 0 665
533 0 4
534 0 2543
535 0 602
536 0 142
537 0 1935
538 0 5
539 0 12
540 0 333
541 0 3108
542 0 5493
543 0 18
544 0 70
545 0 15
546 0 14
547 0 29
548 0 466
549 0 24
550 0 1
599 0 1315
600 0 6909
700 0 2694
710 0 1168
800 0 15
900 0 29139
999 0 9593
table(ipums$SEX, ipums$SEXF)
Male Female
1 7829017 0
2 0 7985852
Population pyramids typically show the size of the population by age group rather than each individual age. Since we are working with data at ten-year intervals, we need to recode AGE into a new factor variable called AGEG that represents ten-year age groups (0-9 through 80+). We can use some programming magic to make it easier. First we make a vector of labels for our new variable. Alternatively, we could have typed out each individual label.
agegrp <- "0-9"
for (n in (1:7)) {
agegrp <- c(agegrp, paste(n, "0-", n, "9", sep = ""))
}
agegrp <- c(agegrp, "80+")
agegrp
[1] "0-9" "10-19" "20-29" "30-39" "40-49" "50-59" "60-69" "70-79"
[9] "80+"
Now we factor AGE and apply the labels. Using integer division saves us a lot of typing.
ipums <- ipums %>% mutate(AGEG = factor(ifelse(AGE >= 80, 8, AGE %/% 10), labels = agegrp))
table(ipums$AGE, ipums$AGEG)
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80+
0 296106 0 0 0 0 0 0 0 0
1 300286 0 0 0 0 0 0 0 0
2 308872 0 0 0 0 0 0 0 0
3 309054 0 0 0 0 0 0 0 0
4 298196 0 0 0 0 0 0 0 0
5 303590 0 0 0 0 0 0 0 0
6 302839 0 0 0 0 0 0 0 0
7 303566 0 0 0 0 0 0 0 0
8 301446 0 0 0 0 0 0 0 0
9 298368 0 0 0 0 0 0 0 0
10 0 301926 0 0 0 0 0 0 0
11 0 286632 0 0 0 0 0 0 0
12 0 298098 0 0 0 0 0 0 0
13 0 287082 0 0 0 0 0 0 0
14 0 281265 0 0 0 0 0 0 0
15 0 278583 0 0 0 0 0 0 0
16 0 280431 0 0 0 0 0 0 0
17 0 275553 0 0 0 0 0 0 0
18 0 275721 0 0 0 0 0 0 0
19 0 265586 0 0 0 0 0 0 0
20 0 0 260800 0 0 0 0 0 0
21 0 0 255791 0 0 0 0 0 0
22 0 0 255426 0 0 0 0 0 0
23 0 0 252500 0 0 0 0 0 0
24 0 0 247152 0 0 0 0 0 0
25 0 0 252613 0 0 0 0 0 0
26 0 0 246913 0 0 0 0 0 0
27 0 0 244719 0 0 0 0 0 0
28 0 0 245230 0 0 0 0 0 0
29 0 0 240021 0 0 0 0 0 0
30 0 0 0 258599 0 0 0 0 0
31 0 0 0 218732 0 0 0 0 0
32 0 0 0 238856 0 0 0 0 0
33 0 0 0 226011 0 0 0 0 0
34 0 0 0 221096 0 0 0 0 0
35 0 0 0 234080 0 0 0 0 0
36 0 0 0 218790 0 0 0 0 0
37 0 0 0 213481 0 0 0 0 0
38 0 0 0 218421 0 0 0 0 0
39 0 0 0 208896 0 0 0 0 0
40 0 0 0 0 227578 0 0 0 0
41 0 0 0 0 182857 0 0 0 0
42 0 0 0 0 205371 0 0 0 0
43 0 0 0 0 189424 0 0 0 0
44 0 0 0 0 174553 0 0 0 0
45 0 0 0 0 192458 0 0 0 0
46 0 0 0 0 171622 0 0 0 0
47 0 0 0 0 171298 0 0 0 0
48 0 0 0 0 170248 0 0 0 0
49 0 0 0 0 167908 0 0 0 0
50 0 0 0 0 0 181565 0 0 0
51 0 0 0 0 0 143690 0 0 0
52 0 0 0 0 0 157917 0 0 0
53 0 0 0 0 0 145010 0 0 0
54 0 0 0 0 0 145420 0 0 0
55 0 0 0 0 0 145939 0 0 0
56 0 0 0 0 0 135290 0 0 0
57 0 0 0 0 0 130243 0 0 0
58 0 0 0 0 0 129714 0 0 0
59 0 0 0 0 0 129099 0 0 0
60 0 0 0 0 0 0 134659 0 0
61 0 0 0 0 0 0 109558 0 0
62 0 0 0 0 0 0 115416 0 0
63 0 0 0 0 0 0 109924 0 0
64 0 0 0 0 0 0 107751 0 0
65 0 0 0 0 0 0 114407 0 0
66 0 0 0 0 0 0 97469 0 0
67 0 0 0 0 0 0 95622 0 0
68 0 0 0 0 0 0 91103 0 0
69 0 0 0 0 0 0 87365 0 0
70 0 0 0 0 0 0 0 87672 0
71 0 0 0 0 0 0 0 71776 0
72 0 0 0 0 0 0 0 72796 0
73 0 0 0 0 0 0 0 66875 0
74 0 0 0 0 0 0 0 63224 0
75 0 0 0 0 0 0 0 61408 0
76 0 0 0 0 0 0 0 52574 0
77 0 0 0 0 0 0 0 48187 0
78 0 0 0 0 0 0 0 43759 0
79 0 0 0 0 0 0 0 41093 0
80 0 0 0 0 0 0 0 0 37664
81 0 0 0 0 0 0 0 0 29765
82 0 0 0 0 0 0 0 0 27474
83 0 0 0 0 0 0 0 0 24083
84 0 0 0 0 0 0 0 0 21397
85 0 0 0 0 0 0 0 0 18359
86 0 0 0 0 0 0 0 0 14974
87 0 0 0 0 0 0 0 0 12853
88 0 0 0 0 0 0 0 0 10114
89 0 0 0 0 0 0 0 0 8779
90 0 0 0 0 0 0 0 0 19727
91 0 0 0 0 0 0 0 0 1758
92 0 0 0 0 0 0 0 0 1435
93 0 0 0 0 0 0 0 0 1094
94 0 0 0 0 0 0 0 0 791
95 0 0 0 0 0 0 0 0 658
96 0 0 0 0 0 0 0 0 451
97 0 0 0 0 0 0 0 0 324
98 0 0 0 0 0 0 0 0 277
99 0 0 0 0 0 0 0 0 242
100 0 0 0 0 0 0 0 0 1386
101 0 0 0 0 0 0 0 0 19
102 0 0 0 0 0 0 0 0 11
103 0 0 0 0 0 0 0 0 15
104 0 0 0 0 0 0 0 0 8
105 0 0 0 0 0 0 0 0 11
106 0 0 0 0 0 0 0 0 3
108 0 0 0 0 0 0 0 0 6
109 0 0 0 0 0 0 0 0 3
110 0 0 0 0 0 0 0 0 6
111 0 0 0 0 0 0 0 0 2
[ reached getOption("max.print") -- omitted 7 rows ]
We eventually want to animate our population pyramid, but we will begin by using facets for year as well as for nativity. For each year, we want the number of people in each age/sex/nativity category as a percentage of the total population in that year.
agedata <- ipums %>% group_by(YEAR) %>%
count(NATIVITY, SEXF, AGEG, wt = PERWT) %>%
mutate(TOTAL = sum(n))
head(agedata)
tail(agedata)
In the population pyramid, we want the number of men to plot on the left. To make that happen, we express the numbers of men as a negative.
agedata <- agedata %>% mutate(n=ifelse(SEXF=="Male",0-n,n))
head(agedata)
tail(agedata)
Now we can make a column graph, but don’t facet yet. Save the graph as pyramid and print it.
pyramid <- ggplot(agedata, aes(AGEG,n/TOTAL,fill=SEXF)) + geom_col()
pyramid
Flip the coordinates and add facets and all the layers you need to make the plot look good. The Set1 palette from RColorBrewer is a good choice for population pyramids. We can move the legend to the bottom with the theme(legend.position = "bottom") layer.
pyramid + coord_flip() + scale_fill_brewer(palette = "Set1") +
scale_y_continuous(breaks = seq(-.15, .15, .05),
labels = c("15%", "10%","5%", "0","5%", "10%","15%")) +
theme_minimal(base_size = 20) + theme(legend.position = "bottom") +
labs(x = "Age", y = "Percent of Population", fill = "Sex", title = "Population Pyramid") +
facet_grid(rows = vars(YEAR), cols = vars(NATIVITY))
We built this set of pyramids one step at a time so that you could see all of the steps that went into it. We could have also done it in one large block of code. Pull together all of the code to make this plot, starting with the original wk6.csv file, into a single block. Don’t include the tests that you used to check your recoding. Add base_size = 25 to your theme_minimal() layer to increase the size of the text. Save the results of the whole block as pyramids.
agegrp <- "0-9"
for (n in (1:7)) {
agegrp <- c(agegrp, paste(n, "0-", n, "9", sep = ""))
}
agegrp <- c(agegrp, "80+")
agegrp
pyramids <- read_csv ("wk6.csv") %>% filter(YEAR > 1950 | !STATEFIP %in% c(2,15)) %>% mutate(NATIVITY = factor(BPL >= 100, labels = c("Born in the U.S.", "Not born in the U.S.")), SEXF = factor(SEX, labels=c("Male", "Female")),AGEG = factor(ifelse(AGE >= 80,8,AGE %/% 10), labels=agegrp)) %>% group_by (YEAR) %>% count(SEXF,NATIVITY,AGEG,wt=PERWT) %>% mutate(TOTAL=sum(n),n=ifelse(SEXF=="Male",0-n,n)) %>% ggplot(aes(AGEG,n/TOTAL,fill=SEXF)) + geom_col() + coord_flip() + scale_fill_brewer(palette="Set2") + scale_y_continuous(breaks = seq(-.15, .15, .05),
labels = c("15%", "10%","5%", "0","5%", "10%","15%")) +
theme_minimal(base_size = 25) + theme(legend.position = "bottom") +
labs(x = "Age", y = "Percent of Population", fill = "Sex", title = "Population Pyramid") +
facet_grid(cols = vars(NATIVITY))
pyramids
Once you have that working, remove the vertical facet for YEAR, so all years are plotting together. Make sure this is the object that has been saved as pyramids. We want to animate change over time rather than displaying it in facets. To animate a ggplot2 graph, we need to install and load the gganimate and gifski packages. Then, we just need to add one more layer to our pyramids plot.
library(gganimate)
library(gifski)
pyramids + transition_manual(YEAR)
anim_save("pyramid.gif")
The transition_manual() function indicates which variable to use as the time dimension. The anim_save() function saves the animation as a .gif. You can call it whatever you want. It will not appear in your editor window, but you should see it in the Viewer tab in the lower right window in RStudio.
The animation would be more useful if the title indicated which year we were seeing at any given moment. You can do that by adding another labs() layer to replace the title. I have also added an options() function to increase the resolution of the final animation.
options(gganimate.dev_args = list(width = 1800, height = 860))
pyramids + transition_manual(YEAR) + labs(title = "Population Pyramid, {current_frame}")
anim_save("pyramid.gif")
In order to see the animation in your output, you need to add { width=100% } in a markdown block (it’s ok if it doesn’t appear in your editor window; you should see it when you click Preview). Put it below:
Save and preview, then upload to Canvas. You are now ready for Lab 6!